R Setup

pacs <- c("plotly","tidyverse","ggrepel","fastDummies","knitr","kableExtra",
             "splines","reshape2","PerformanceAnalytics","correlation","see",
             "ggraph","psych","nortest","rgl","car","ggside","tidyquant","olsrr",
             "jtools","ggstance","magick","cowplot","emojifont","beepr","Rcpp",
             "equatiomatic")

options(rgl.debug = TRUE)

if(sum(as.numeric(!pacs %in% installed.packages())) != 0){
  instalador <- pacs[!pacs %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(pacs, require, character = T) 
} else {
  sapply(pacs, require, character = T) 
}
##               plotly            tidyverse              ggrepel 
##                 TRUE                 TRUE                 TRUE 
##          fastDummies                knitr           kableExtra 
##                 TRUE                 TRUE                 TRUE 
##              splines             reshape2 PerformanceAnalytics 
##                 TRUE                 TRUE                 TRUE 
##          correlation                  see               ggraph 
##                 TRUE                 TRUE                 TRUE 
##                psych              nortest                  rgl 
##                 TRUE                 TRUE                 TRUE 
##                  car               ggside            tidyquant 
##                 TRUE                 TRUE                 TRUE 
##                olsrr               jtools             ggstance 
##                 TRUE                 TRUE                 TRUE 
##               magick              cowplot            emojifont 
##                 TRUE                 TRUE                 TRUE 
##                beepr                 Rcpp         equatiomatic 
##                 TRUE                 TRUE                 TRUE

REGRESSÃO LINEAR SIMPLES

Introduction

In linear regression, the aim is to model the relationship between a dependent variable and one or more explanatory variables. In simple linear regression, there is just one explanatory variable.


When use linear regression ?
When the outcome of interest is on some sort of continuous scale (for example, quantity, money, height, weight).

Knowing my data Example 1

load(file = "tempodist.RData")
tempodist %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F,
                font_size = 22)
tempo distancia
15 8
20 6
20 15
40 20
50 25
25 11
10 5
55 32
35 28
30 20
glimpse(tempodist)
## Rows: 10
## Columns: 2
## $ tempo     <int> 15, 20, 20, 40, 50, 25, 10, 55, 35, 30
## $ distancia <int> 8, 6, 15, 20, 25, 11, 5, 32, 28, 20

The data set is composed by time and distance, both quantitative data.

#Gráfico de dispersão básico
tempo <- c(tempodist[[1]])
distancia <- c(tempodist[[2]])
plot(tempo ~ distancia)

GRÁFICO DE DISPERSÃO

# linha 3 nuvem de pontos, the color is given in hexadecimal scale
# Linha de tendência
ggplotly(
  ggplot(tempodist, aes(x = tempo, y = distancia))+
  geom_point(color = "#E044A7", size = 2.5)+ 
  geom_smooth(aes(color = "Fitted Values"),
                method = "lm", formula = y ~ x, se = F, size = 1) + 
    labs(x = "Distância",
         y = "Tempo",
         title = paste("R²:",
                       round(((cor(tempodist$tempo, tempodist$distancia))^2),4))) +
    scale_color_manual("Legenda:",
                       values = "grey50") +
    theme_classic()
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

Modeling a Simple Linear Regression model

Model Estimation Methods

Estimation Methods are methods through which Regression Analysis is conducted to generate a linear equation based on the data points given on a graph. Generally used estimation methods include Ordinary Least Squares (OLS), Method of Moments (MoM), and Maximum Likelihood Estimate (MLE).

Estimation of the Linear Regression Model by Ordinary Least Squares

What are the rules to proceed the Parameters Estimation (Parameters Estimation Criteria)? We have 2 survival rules 1) Sum of errors equal to zero see why in (https://thestatsgeek.com/2020/03/23/the-mean-of-residuals-in-linear-regression-is-always-zero/) 2) Sum of squared errors being the minimum possible

            Parameters α and β can be estimated by ordinary least squares (OLS) , in which the sum of the squares of the error terms is minimized.
# Estimates the model and generates a list with parameters
model_timedist <- lm(formula = tempo ~ distancia, data = tempodist)

# to observe such parameters we can call summary and it will give us the result at the console
summary(model_timedist)
## 
## Call:
## lm(formula = tempo ~ distancia, data = tempodist)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6081  -3.9358   0.6419   5.1351   8.6486 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.8784     4.5323   1.297 0.230788    
## distancia     1.4189     0.2355   6.025 0.000314 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.719 on 8 degrees of freedom
## Multiple R-squared:  0.8194, Adjusted R-squared:  0.7969 
## F-statistic:  36.3 on 1 and 8 DF,  p-value: 0.0003144

For the model estimation we have that the parameters α = 5.8784 and β = 1.419. The β for the distance is statiscally significant at 95%. However the Intercept(α) is not significant.See a more detailed table of results bellow:

summ(model_timedist, confint = T, digits = 4, ci.width = .95)
Observations 10
Dependent variable tempo
Type OLS linear regression
F(1,8) 36.3031
0.8194
Adj. R² 0.7969
Est. 2.5% 97.5% t val. p
(Intercept) 5.8784 -4.5732 16.3299 1.2970 0.2308
distancia 1.4189 0.8759 1.9620 6.0252 0.0003
Standard errors: OLS
# to export the table do... export_summs(model_timedist, scale = F, digits = 4)

As mentioned above β is statiscally significant, however the Intercept(α) is not. So what should I do?


O fato de α não ser estatísticamente diferente de 0 não permite que o mesmo seja removido igualanlo a 0. Tal procedimento é um erro ERRO pois eliminar o intercepto quando este não for estatisticamente significante gera o viés que pode impactar no beta, e impacta na dispersão de pontos. O prof. deu um exemplo para mostrar essas implicações de quando fazemos isto e nomeou de modelo_errado.

Model View with viewer

Is necessary to generate the equation that fits to this model.

#função 'extract_eq' do pacote 'equatiomatic'
extract_eq(model_timedist, use_coefs = T) %>% 
  kable() %>%
  kable_styling(bootstrap_options = "striped",
                full_width = F,
                font_size = 28
                )
## Registered S3 methods overwritten by 'broom':
##   method            from  
##   tidy.glht         jtools
##   tidy.summary.glht jtools
x
\[ \operatorname{\widehat{tempo}} = 5.88 + 1.42(\operatorname{distancia}) \]

Save fitted value and error in an additional column

tempodist$yhat <- model_timedist$fitted.values
tempodist$erro <- model_timedist$residuals 
tempodist %>%
  select(tempo, distancia, yhat, erro) %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped", 
                full_width = F, 
                font_size = 22)
tempo distancia yhat erro
15 8 17.22973 -2.229730
20 6 14.39189 5.608108
20 15 27.16216 -7.162162
40 20 34.25676 5.743243
50 25 41.35135 8.648649
25 11 21.48649 3.513514
10 5 12.97297 -2.972973
55 32 51.28378 3.716216
35 28 45.60811 -10.608108
30 20 34.25676 -4.256757
Error

For our assumed equation y = mx + c represents the expected mean value of y over many observations of x. For a specific single observation of x, we do not expect y to be precisely on the tendency line because each individual observation will have an associated error term either above or below the expected line. We can determine this error term existent in the fitted model by calculating the difference between the real value of y and the one predicted by our model yhat. For example, at x = 8, our modeled value of yhat is 17.229, but the real value is 15, producing an error of 2,229. These errors are known as the residuals of our model. The residuals for our data set are illustrated at the following chart, as the red line segments.

ggplotly(
  ggplot(tempodist, aes(x = distancia, y = tempo)) +
    geom_point(color = "#39568CFF", size = 2.5) +
    geom_smooth(aes(color = "Fitted Values"),
                method = "lm", formula = y ~ x, se = F, size = 2) +
    geom_hline(yintercept = 30, color = "grey50", size = .5) +
    geom_segment(aes(color = "Ychapéu - Ymédio", x = distancia, xend = distancia,
                     y = yhat, yend = mean(tempo)), size = 0.7, linetype = 2) +
    geom_segment(aes(color = "Erro = Y - Ychapéu", x = distancia, xend = distancia,
                     y = tempo, yend = yhat), size = 0.7, linetype = 3) +
    labs(x = "Distância",
         y = "Tempo") +
    scale_color_manual("Legenda:",
                       values = c("red", "grey50", "green")) +
    theme_classic()
)
Measuring the fit of the model

One way to measure good is your model at explaining the outcome is to compare it to a situation where you have no input and no model at all. In this situation, all you have is your outcome values, which can be considered a random variable with a mean and a variance. In of this exercise the horizontal line representing the mean of y as our ‘random model’‍, and we can calculate the residuals around the mean (green line).

Note that the green line is at the chart above is (ychapeu − ymedio) that represents the deviation of values of the regression model for each observation in relation to the average.

Coefficient confidence (R2)

R2 <- (sum((tempodist$yhat - mean(tempodist$tempo))^2))/
      ((sum((tempodist$yhat - mean(tempodist$tempo))^2)) + (sum((tempodist$erro)^2)))
print(paste0('R2 = ', round(R2, digits = 3)))
## [1] "R2 = 0.819"

The R2 adjusted is the square correlation

# For Adjusted R2
cor(tempodist[1:2])
##               tempo distancia
## tempo     1.0000000 0.9052213
## distancia 0.9052213 1.0000000
Determining the Best Fit

To calculate the best fit linear model for our data we use the lm() function. Once we have run it, the model and all the details will be saved in the work session for further investigation or use. This tells us that that our best fit model (the one that minimizes the average squares of the residuals) is:

#auxiliar model with R² = 100% (para fins didáticos)
#note que aqui o yhat é a variável dependente
modelo_auxiliar <- lm(formula = yhat ~ distancia,
                   data = tempodist)
summary(modelo_auxiliar)
## Warning in summary.lm(modelo_auxiliar): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = yhat ~ distancia, data = tempodist)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.885e-15 -6.191e-16  1.057e-15  2.063e-15  2.508e-15 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 5.878e+00  1.977e-15 2.973e+15   <2e-16 ***
## distancia   1.419e+00  1.027e-16 1.381e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.931e-15 on 8 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.908e+32 on 1 and 8 DF,  p-value: < 2.2e-16
#chart showing what would be the perfect fit with R2 = 100%
my_plot <-
  ggplot(tempodist, aes(x = distancia, y = yhat)) +
  geom_point(color = "#39568CFF", size = 5) +
  geom_smooth(aes(color = "Fitted Values"),
              method = "lm", formula = y ~ x, se = F, size = 2) +
  labs(x = "Distância",
       y = "Tempo") +
  scale_color_manual("Legenda:",
                     values = "grey50") +
  theme_cowplot()
my_plot

However the case of study has an R2 = 81,9%, lets come back to it and compare ehat happens at different confidence levels.

confidence interval

At confidence level of 90% :
confint(model_timedist, level = 0.90)  # siginificância 10%
##                    5 %      95 %
## (Intercept) -2.5497025 14.306459
## distancia    0.9810005  1.856837

Note that the interval at 10% of significancy IC = [-2.5497025, 14.306459], contains α = 0.

ggplotly(
  ggplot(tempodist, aes(x = distancia, y = tempo)) +
    geom_point(color = "#8470FF") +
    geom_smooth(aes(color = "Fitted Values"),
                method = lm, formula = y ~ x,
                level = 0.90) + #cria tunel nivel sig 90%
    labs(x = "Distance" ,
         y = "Tempo") +
    scale_color_manual("Legenda:",
                       values = "grey50") +
    theme_bw()
  
)
At confidence level of 95%:
confint(model_timedist, level = 0.95)
##                  2.5 %    97.5 %
## (Intercept) -4.5731877 16.329944
## distancia    0.8758613  1.961977

Note that the interval at 5% of significancy IC = [-4.5731877, 16.329944], contains α = 0.

ggplotly(
  ggplot(tempodist, aes(x = distancia, y = tempo)) +
    geom_point(color = "#0000FF") +
    geom_smooth(aes(color = "Fitted Values"),
                method = "lm", formula = y ~ x,
                level = 0.95) +
    labs(x = "Distância",
         y = "Tempo") +
    scale_color_manual("Legenda:",
                       values = "grey50") +
    theme_bw()
)
At confidence level of 99%:
confint(model_timedist, level = 0.99)
##                  0.5 %    99.5 %
## (Intercept) -9.3293361 21.086093
## distancia    0.6287345  2.209103

Note that the interval at 1% of significancy IC = [-9.3293361, 21.086093], contains α = 0.

ggplotly(
  ggplot(tempodist, aes(x = distancia, y = tempo)) +
    geom_point(color = "#00BFFF") +
    geom_smooth(aes(color = "Fitted Values"),
                method = "lm", formula = y ~ x,
                level = 0.99) +
    labs(x = "Distância",
         y = "Tempo") +
    scale_color_manual("Legenda:",
                       values = "grey50") +
    theme_bw()
)
At confidence level of 99.99%:
confint(model_timedist, level = 0.9999)
##                 0.005 % 99.995 %
## (Intercept) -26.3918115 38.14857
## distancia    -0.2578223  3.09566

Note that the interval at 99,99% of confidence level IC = [-26.3918115, 38.14857], contains α = 0.

ggplotly(
  ggplot(tempodist, aes(x = distancia, y = tempo)) +
    geom_point(color = "#40E0D0") +
    geom_smooth(aes(color = "Fitted Values"),
                method = "lm", formula = y ~ x,
                level = 0.9999) +
    labs(x = "Distância",
         y = "Tempo") +
    scale_color_manual("Legenda:",
                       values = "grey50") +
    theme_bw()
)

Lets try predict something

#Fazendo predições em modelos OLS - e.g.: qual seria o tempo gasto, em média, para
#percorrer a distância de 25km?
answer1 <- predict(object = model_timedist,
        data.frame(distancia = 25))
print(paste0("The average time spent to travel 25km was ", answer1))
## [1] "The average time spent to travel 25km was 41.3513513513514"
#Caso se queira obter as predições com os IC
answer2 <- predict(object = model_timedist,
        data.frame(distancia = 25),
        interval = "confidence", level = 0.95)
print(paste0("The average time spent to travel 25km was ", answer2[[1]], " with a confidence interval of ", answer2[[2]], " to ", answer2[[3]] ))
## [1] "The average time spent to travel 25km was 41.3513513513514 with a confidence interval of 34.803058612367 to 47.8996440903357"

New modeling tatics for solve the issue α = 0

As we could notice during this exercise α = 0 for all confidences interval tested.


No caso de o α não ser signinficante o caminho correto é aumentar o tamanho da amostra? Se for possivel sim, mas se você não conseguir aumentar o tamanho da amostra não há problema algum a análise pode continuar se o β for significante e o α não.

So lets solve it by increasing the sample size

# turn to triple data set
timedist_repeat <- tempodist %>%
  slice(rep(1:n(), each=3))

# calculate the model
modelo_timedist_repeat<- lm(formula = tempo ~ distancia,
                        data = timedist_repeat)
# Descriptive stat
summary(modelo_timedist_repeat)
## 
## Call:
## lm(formula = tempo ~ distancia, data = timedist_repeat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6081  -4.2568   0.6419   5.6081   8.6486 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.8784     2.4226   2.426   0.0219 *  
## distancia     1.4189     0.1259  11.272 6.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.22 on 28 degrees of freedom
## Multiple R-squared:  0.8194, Adjusted R-squared:  0.813 
## F-statistic: 127.1 on 1 and 28 DF,  p-value: 6.424e-12

Note that after increase the number of samples the α is now significant

# Calculating the confidence interval
confint(modelo_timedist_repeat, level = 0.95)
##                 2.5 %    97.5 %
## (Intercept) 0.9158438 10.840913
## distancia   1.1610684  1.676769

Now with bigger sample we have an IC = [0.9158438, 10.840913] to α, what means that it does not contains 0 value anymore.

ggplotly(
  ggplot(timedist_repeat, aes(x = distancia, y = tempo)) +
    geom_point(color = "#0000FF") +
    geom_smooth(aes(color = "Fitted Values"),
                method = "lm", formula = y ~ x,
                level = 0.95) +
    labs(x = "Distância",
         y = "Tempo",
         title = "Chart with bigger sample at 95% of confidence level") +
    scale_color_manual("Legenda:",
                       values = "grey50") +
    theme_bw()
)

Note that happened an narrowing at the width of the interval.

Multiple linear regression

The methodology for multiple linear regression is similar to the simple linear regression, but with increased dimensionality. At the Multiple linear regression our inputs are a set of p variables x1, x2,…,xp. So we will have a equation:
y=α+β_1 x_1+β_2 x_2+…+β_p x_p

load(file = "paises.RData")
glimpse(paises)
## Rows: 50
## Columns: 4
## $ pais  <chr> "Argentina", "Australia", "Austria", "Belgium", "Brazil", "Canad…
## $ cpi   <dbl> 3.9, 8.7, 7.9, 7.1, 4.0, 8.9, 6.2, 2.5, 4.0, 6.3, 4.6, 9.3, 2.1,…
## $ idade <int> 72, 64, 72, 67, 59, 61, 70, 49, 79, 58, 42, 76, 59, 70, 66, 60, …
## $ horas <dbl> 35.0, 32.0, 32.0, 30.1, 35.0, 33.4, 34.0, 34.0, 33.0, 32.0, 38.1…

The data base for this exercise is composed by 50 paises and its data related to:
cpi = The Corruption Perceptions Index (CPI) is an index which ranks countries by their perceived levels of public sector corruption.
idade = average age of the billionaries( in dolar) for the respective country.
horas = average of number of works hours, related to the economically active population for the respective country.

summary(paises)
##      pais                cpi            idade           horas      
##  Length:50          Min.   :0.800   Min.   :34.00   Min.   :26.80  
##  Class :character   1st Qu.:2.575   1st Qu.:58.00   1st Qu.:31.40  
##  Mode  :character   Median :3.950   Median :62.00   Median :32.60  
##                     Mean   :4.894   Mean   :60.48   Mean   :32.66  
##                     3rd Qu.:7.475   3rd Qu.:66.75   3rd Qu.:34.40  
##                     Max.   :9.300   Max.   :79.00   Max.   :38.10
#Chart 3D with scatterplot
scatter3d(cpi ~ idade + horas,
          data = paises,
          surface = F,
          point.col = "#440154FF",
          axis.col = rep(x = "black",
                         times = 3))
## Warning: 'rgl::rgl.viewpoint' is deprecated.
## Use 'view3d' instead.
## See help("Deprecated")
## Warning: 'rgl::rgl.bg' is deprecated.
## Use 'bg3d' instead.
## See help("Deprecated")
## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")
## Warning: 'rgl::rgl.lines' is deprecated.
## Use 'segments3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.lines' is deprecated.
## Use 'segments3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.lines' is deprecated.
## Use 'segments3d' instead.
## See help("Deprecated")
## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")

## Warning: 'rgl::rgl.texts' is deprecated.
## Use 'text3d' instead.
## See help("Deprecated")
Correlations at MLR

Structuring a diagram that displays the inter-relation between variables, as well the correlation magnitude between them.

paises %>%
  correlation( method = "pearson") %>%
  plot()

By the diagram is possible to verify an strong positive correlation between idade and cpi, a moderate negative correlation between idade and horas, and a strong negative correlation between horas and cpi.

#distribuições das variáveis, scatters, valores das correlações e suas  respectivas significâncias

chart.Correlation((paises[2:4]), histogram = TRUE)
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

Generating a Multiple LR model
model_countries <- lm(formula = cpi ~ . -pais, #formula considera todas as variaveis exceto pais
                      data = paises)
Extracting parameters from the model
summary(model_countries)
## 
## Call:
## lm(formula = cpi ~ . - pais, data = paises)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.123 -1.440 -0.316  1.570  4.251 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 11.97192    5.16537   2.318  0.02487 * 
## idade        0.09970    0.03266   3.052  0.00373 **
## horas       -0.40134    0.13467  -2.980  0.00455 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.249 on 47 degrees of freedom
## Multiple R-squared:  0.3239, Adjusted R-squared:  0.2951 
## F-statistic: 11.26 on 2 and 47 DF,  p-value: 0.0001013


More organized results, togheter with confidency intervals can be displayed in the following line:

summ(model_countries, confint = T, digits = 5, ci.width = .95)
Observations 50
Dependent variable cpi
Type OLS linear regression
F(2,47) 11.25669
0.32387
Adj. R² 0.29510
Est. 2.5% 97.5% t val. p
(Intercept) 11.97192 1.58053 22.36331 2.31773 0.02487
idade 0.09970 0.03399 0.16541 3.05219 0.00373
horas -0.40134 -0.67226 -0.13042 -2.98017 0.00455
Standard errors: OLS
Adding fitted values at the data base
paises$cpi_fitted <- model_countries$fitted.values
3D Chart with fitted values
scatter3d(cpi ~ idade + horas,
          data = paises,
          surface = T, fit = "linear",
          point.col = "#00000FFF",
          axis.col = rep(x = "black",
                         times = 3))

Dummy Vars

**** What happens if my variables are Qualitative ? ****

For don`t attribute to them arbitrary values and properly input them at the model is necessary convert them on Dummy variable.

One-hot encoding for 2 attributes

It will generate only one Dummy var D1

load (file = "corrupcao.RData")
glimpse(corrupcao)
## Rows: 50
## Columns: 3
## $ pais   <chr> "Argentina", "Australia", "Austria", "Belgium", "Brazil", "Cana…
## $ cpi    <dbl> 3.9, 8.7, 7.9, 7.1, 4.0, 8.9, 6.2, 2.5, 4.0, 6.3, 4.6, 9.3, 2.1…
## $ regiao <fct> América do Sul, Oceania, Europa, Europa, América do Sul, EUA e …
print(unique(corrupcao[[3]]))
## [1] América do Sul Oceania        Europa         EUA e Canadá   Ásia          
## Levels: América do Sul Oceania Europa EUA e Canadá Ásia

The frequency table taking as reference the column 3 (regiao).

table(corrupcao$regiao)
## 
## América do Sul        Oceania         Europa   EUA e Canadá           Ásia 
##              5              2             24              2             17

Univariate (in this case) Statistics with summary.

summary(corrupcao)
##      pais                cpi                   regiao  
##  Length:50          Min.   :0.800   América do Sul: 5  
##  Class :character   1st Qu.:2.575   Oceania       : 2  
##  Mode  :character   Median :3.950   Europa        :24  
##                     Mean   :4.894   EUA e Canadá  : 2  
##                     3rd Qu.:7.475   Ásia          :17  
##                     Max.   :9.300
#Exploração visual do Corruption Perception Index para cada um dos países
corrupcao %>%
  group_by(regiao) %>%
  mutate(rotulo = paste(pais, cpi)) %>%
  ggplot(aes(x = as.numeric(regiao), y = cpi, label = rotulo)) +
  geom_point(aes(x = regiao, y = cpi), color = "#FDE725FF", alpha = 0.5, size = 5) +
  scale_color_manual("Legenda:",
                     values = "#440154FF") +
  labs(x = "Região",
       y = "Corruption Perception Index") +
  geom_text_repel() +
  theme_bw()

#Exploração visual do Corruption Perception Index para cada um dos países, com valores médios por região
corrupcao %>%
  group_by(regiao) %>%
  mutate(cpi_medio = mean(cpi, na.rm = TRUE)) %>%
  mutate(rotulo = paste(pais, cpi)) %>%
  ggplot(aes(x = as.numeric(regiao), y = cpi, label = rotulo)) +
  geom_point(aes(x = regiao, y = cpi), color = "#FDE725FF", alpha = 0.5, size = 5) +
  geom_line(aes(x = regiao, y = cpi_medio, 
                group = 1, color = "CPI Médio"), linewidth = 1.5) +
  scale_color_manual("Legenda:",
                     values = "#440154FF") +
  labs(x = "Região",
       y = "Corruption Perception Index") +
  geom_text_repel() +
  theme_bw() +
  theme(legend.position = "bottom")

Lets start convert the variable “regiao” into Dummies. The following code does:
a) stabilishes dummies to represent each attribute for “region” at the data set
b)removes the original variable that was converted into dummy. (remove_selected_columns = T)
c) stabilishes as coategory of reference the one that is most frequent, remove_most_frequent_dummy = T.

corrupcao_dummies <- dummy_columns(.data = corrupcao,
                                   select_columns = "regiao",
                                   remove_selected_columns = T,
                                   remove_most_frequent_dummy = T)
Generating the regression model with the dummy vars
model_dummy1 <- lm(cpi~.-pais,
                   data = corrupcao_dummies)
summary(model_dummy1)
## 
## Call:
## lm(formula = cpi ~ . - pais, data = corrupcao_dummies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1583 -0.9221 -0.0939  1.2739  3.0417 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.2583     0.3596  17.406   <2e-16 ***
## `regiao_América do Sul`  -2.0783     0.8659  -2.400   0.0206 *  
## regiao_Ásia              -3.9289     0.5584  -7.036    9e-09 ***
## `regiao_EUA e Canadá`     1.7417     1.2964   1.343   0.1859    
## regiao_Oceania            2.7417     1.2964   2.115   0.0400 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.761 on 45 degrees of freedom
## Multiple R-squared:  0.603,  Adjusted R-squared:  0.5677 
## F-statistic: 17.09 on 4 and 45 DF,  p-value: 1.37e-08
#Plotando o modelo_corrupcao_dummies de forma interpolada
my_plot3 <- 
  corrupcao %>%
  mutate(rotulo = paste(pais, cpi)) %>%
  ggplot(aes(x = as.numeric(regiao), y = cpi, label = rotulo)) +
  geom_point(color = "#66FFFF", alpha = 0.5, size = 4) +
  stat_smooth(aes(color = "Fitted Values"),
              method = "lm", 
              formula = y ~ bs(x, df = 4),
              se = T) +
  labs(x = "Região",
       y = "Corruption Perception Index") +
  scale_x_discrete(labels = c("1" = "América do Sul", 
                              "2" = "Oceania", 
                              "3" = "Europa", 
                              "4" = "EUA e Canadá", 
                              "5" = "Ásia")) +
  scale_color_manual("Legenda:",
                     values = "#FF99FF") +
  geom_text_repel() +
  theme_bw() +
  theme(legend.position = "bottom")
my_plot3
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

ggsave("my_plot3.png")
## Saving 7 x 5 in image
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
my_plot3 <- image_read("my_plot3.png") #função do pacote 'magick'

gif <- image_read("https://media.giphy.com/media/EyqAY5E3IcwAD3lB3y/giphy.gif")

frames <- image_composite(my_plot3, gif, offset = "+880+30")

animation <- image_animate(frames, fps = 10) #função do pacote 'magick'
image_scale(animation, "x550")

beep("treasure")